library(magrittr) # the pipe
library(tidyverse) # for data wrangling + visualization
library(tidymodels) # for modeling
library(gt) # for pretty tables
theme_set(theme_bw(base_size = 12))
boston_cocktails <- readr::read_csv('../data/boston_cocktails.csv', show_col_types = FALSE)Lab 3 - The Recipes package
SOLUTIONS
Packages
We will use the following package in this lab.
Data: The Boston Cocktail Recipes
The Boston Cocktail Recipes dataset appeared in a TidyTuesday posting. TidyTuesday is a weekly data project in R.
The dataset is derived from the Mr. Boston Bartender’s Guide, together with a dataset that was web-scraped as part of a hackathon.
This dataset contains the following information for each cocktail:
| variable | class | description |
|---|---|---|
| name | character | Name of cocktail |
| category | character | Category of cocktail |
| row_id | integer | Drink identifier |
| ingredient_number | integer | Ingredient number |
| ingredient | character | Ingredient |
| measure | character | Measurement/volume of ingredient |
| measure_number | real | measure as a number |
Exercises
Exercise 1
First use skimr::skim and DataExplorer::introduce to assess the quality of the data set.
Next prepare a summary. What is the median measure number across cocktail recipes?
Exercise 2
From the boston_cocktails dataset select the name, category, ingredient, and measure_number columns and then pivot the table to create a column for each ingredient. Fill any missing values with the number zero.
Since the names of the new columns may contain spaces, clean them using the janitor::clean_names(). Finally drop any rows with NA values and save this new dataset in a variable.
How much gin is in the cocktail called Leap Frog Highball?
Exercise 3
Prepare a recipes::recipe object without a target but give name and category as ‘id’ roles. Add steps to normalize the predictors and perform PCA. Finally prep the data and save it in a variable.
How many predictor variables are prepped by the recipe?
Exercise 4
Apply the recipes::tidy verb to the prepped recipe in the last exercise. The result is a table identifying the information generated and stored by each step in the recipe from the input data.
To see the values calculated for normalization, apply the recipes::tidy verb as before, but with second argument = 1.
What ingredient is the most used, on average?
Exercise 5
Now look at the result of the PCA, applying the recipes::tidy verb as before, but with second argument = 2. Save the result in a variable and filter for the components PC1 to PC5. Mutate the resulting component column so that the values are factors, ordering them in the order they appear using the forcats::fct_inorder verb.
Plot this data using ggplot2 and the code below
ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text=element_text(size=7),
axis.title=element_text(size=14,face="bold"))How would you describe the drinks represented by PC1?
Exercise 6
As in the last exercise, use the variable with the tidied PCA data and use only PCA components PC1 to PC4. Take/slice the top 8 ingedients by component, ordered by their absolute value using the verb dplyr::slice_max. Next, generate a grouped table using gt::gt, colouring the cell backgrounds (i.e. fill) with green for values \(\ge0\) and red for values \(<0\).
What is the characteristic alcoholic beverage of each of the first 4 principle components.
Exercise 7
For this exercise, bake the prepped PCA recipe using recipes::bake on the original data and plot each cocktail by its PC1, PC2 component, using
ggplot(aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 2) +
geom_text(check_overlap = TRUE, hjust = "inward") +
labs(color = NULL)Can you create an interpretation of the PCA analysis?
Grading
Total points available: 30 points.
| Component | Points |
|---|---|
| Ex 1 - 7 | 30 |